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ABSTRACT Karlin and Altschul in their statistical analysis for multiple high- 
scoring segments in molecular sequences introduced a distribution function which 
gives the probability there are at least r distinct and consistently ordered seg- 
ment pairs all with score at least x. For long sequences this distribution can be 
expressed in terms of the distribution of the length of the longest increasing sub- 
sequence in a random permutation. Within the past few years, this last quantity 
has been extensively studied in the mathematics literature. The purpose of this 
I note is to summarize these new mathematical developments in a form suitable 

\^ ' for use in computational biology. 

Dedicated to Barry McCoy on the occasion of his sixtieth birthday. 

O 

o 

^ ■ 1 The Distribution Function 

Karlin and Altschul in their statistical analysis for multiple high-scoring 
segments in molecular sequences, introduced the following distribution func- 
tion: Let F{r; y) denote the probability that there are at least r distinct and 
consistently ordered segment pairs all with score at least x. They further in- 
^ troduced a parameter y = KNe^^^ where K and A are parameters related 

to the scoring system, see ||^ for details. We use the parameter y with- 
out further reference to x. For long sequences (A^ oo) this distribution 
function is well approximated by p[ 



oo 



F{r-y) = e-yJ2y^,r = 1,2,..., (1.1) 

k=r 

where Rk,r is the number of permutations of the integers {1, . . . ,k} that 
contain an increasing subsequence of length at least r. Let Xy denote a 



AMS Subject classification: Primary 05A16; secondary 60F10, 92D20. 
Keywords and phrases: computational biology, random permutations, increas- 
ing subsequences, Painleve 11. 



2 Craig A. Tracy and Harold Widom 

positive integer valued random variable such that 

Prob {Xy >r) = F{r;y). 

If R^j. denotes the complement of Rk,r, i-e. the number of permutations 
cr ^ Sk all of whose increasing subsequences have length strictly less than 
r, then clearly 

Rl, = #{aeSk:ik{cT)<r} 

= #{aG5fc:4(a)<r-l} 

'■= fk,r-l 

where £k(cr) is the length of the longest increasing subsequence in cr e 5"^. 
Remarks. 

1. F{r; y) is a distribution function in r with parameter y. 

2. Dropping the requirement of consistent ordering has the effect of re- 
placing J. by k\ in Thus the segments are Poisson distributed 
with parameter y. 



2 Summary of Known Properties 

By convention, fo^r '■= 1 for all r and we note that fk^r = k\ if k < r. It is 
also convenient to introduce the parameter t > 0, 



If we define 



then 



y = t'. 



Drit)=j:^, (2.1) 

k=0 



k=r 



e-y 



k\ ^ 

k=r k=r 
k=r \ k=0 



l-e-yDr^,{^). (2.2) 
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From Gessel we know that Dr{t) is the r x r Toeplitz determinant 
with symbol 



In the past few years, Dr has been extensively studied in connection with 
the limiting distribution of the length of the longest increasing subsequence 
of a random permutation, see Baik, Deift and Johansson and Aldous 
and Diaconis [§, |[. We now summarize some of the these results. Gessel's 
theorem says that for all r = 1, 2, . . . 



Dr{t) = det 



0<i,j<r-l 



where bj := Ij{2t) and Ij is the modified Bessel function. For small r one 
simply evaluates this determinant to obtain 

F{l;y) = l-e-^ 

F{2;y) = 1 - 6o e'^ 

F{3;y) = I - {bl - bj) , 

F{4- y) = l- {bl + 2blb, - 2blbo - e"^- 

From (|2.2| ) we see that 

My) ■= e-^Dri^) = Prob {Xy < r) . 

Johansson has shown that for any given e > 0, there exist C and 6 > 
such that 

0<My)<Ce-'y a il + e)r<2^, 
0<1-My)<j if {l-e)r>2^. 

The breakthrough result of Baik-Deift- Johansson is the sharper asymp- 
totic result 

lim M/y+sv^/'^iy) = ^2(5) (2.3) 



y->oo 



where F2 is the distribution function, first discovered by the present authors 
in the context of random matrix theory [11| (see [|13| for a review). 



F2{s) = exp ^- 



x-s)q{xYdx] (2.4) 



and q is the solution of the Painleve II equation 

q" = sq + 2q^ (2.5) 
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satisfying q{s) ~ Ai(s) as s ^ oo. (Here Ai is the Airy function.) It is 
known that such a solution to (|2.5|) exists and is unique. A graph of the 
density dF2/ds as well as some statistics of F2 can be found in |T^. In terms 
of the random variable Xy this says 



converges weakly to a random variable, call it with distribution function 
F2. It was also proved that the scaled moments converge to the moments 
oiF2 i. 

For finite r we now describe some results of Periwal and Shevitz [§, 
Hisakado , Tracy and Widom |12[ , and Adler and van Moerbeke [0] . (We 



follow the notation of |T^.) We have the representation 

0,(|/) = exp(^-4^ log(t/r)r(l-$,(r))rfr^ , y = t', (2.6) 
where as a function of t satisfies the equation 

We want the solution $r that satisfies 

^2r 



$. = 1-7^ + 0(^2'-+^), t-^O. (2.^ 



Setting 



:= 1 - 



we have the recursion relation, sometimes referred to as the discrete Painleve 
II equation, 

LUr + {l-U^){Ur-l + Ur-,l)=0, r = 1, 2, . . . . (2.9) 

The initial conditions for this recursion relation can be obtained from 0o = 
e~y and 0i = boe~y. A computation show^ 

U - I U - ^'^^'^ 



^The signs of Uq and Ui are not fixed from (/jq and (f)i. In [I2| the leading 
small t behavior of Un is computed. We use this to fix the signs of Uq and Ui. 



1. On a Distribution Function Arising in Computational Biology 5 



To make computational use of this distribution function, one needs com- 
putationally feasible formulas for the first few moments of Xy for all y; 
and more generally, the distribution function itself. Here are some partial 
results. Of course. 



oo 



E(Xj,) = ^r(0,,-0,_i). (2.10) 

r=l 

From (|2]|) and it follows that 



and thus 



R 

E(X,) = J]r(0,,-0,,_i) + O(i/«+^) 

r=l 



Using the recursion relation ( |2.9| ) we can compute the first few [/„'s and 
expand these for small y. In this way we derive 

and similarly for the variance 

Var(X,) =y-^-y^ + }ly^-^^y^ + O(y^). (2.12) 
Note that the leading order terms are Poisson. Higher order expansion co- 



efficients are given in Table ^ 



From we know that for large y, the essential contribution in ( |2.10D 
comes from r around 2y^. Thus for y ^ oo 

E{Xy) = 2^ + y'/'Eix) + o{y'/') (2.13) 

= 2^^- 1.77109 ?/^/^ + o(y^/^) (2.14) 



where x has distribution function ( |2.4|) . We note for future reference, 

Var(x) ~ 0.8132. 

The small y expansion of E{Xy) was computed through order 20. If we 
demand that the last coefficient in this expansion be less than, say, 1/10, 
then y < 7.8. Evaluating this expansion at y = 7.8 gives E{Xy) = 3.66 
whereas the large y expansion evaluated at y = 7.8 equals 3.09 which is 
a difference of 0.57. To improve the overlap of these two expansions, one 
needs to compute the error term in ( p.l3| ). 
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X 


y 


E(X,) 


Var(Xj 


7.6 


536.8 


41.3 


6.6 


6.7 


712.1 


48.1 


7.3 


5.8 


944.6 


55.9 


8.0 



TABLE 1.1. For Karlin-Altschul parameters A = 0.314, K = 0.17 and 
= 34336, and three values of the normalized score x, the expected value and 
variance of Xy are computed using the large y expansions. 



3 An Example 

Karlin and Altschul give the parameters in their theory for the pairwise 
sequence comparison of the chicken gene X protein and the fowlpox virus 
antithrombin III homolog. The scoring system gives A = 0.314, K = 0.17 
and = 34336. For the three alignments found (see Table 3 in the 
normalized scores (values of x) are 7.6, 6.7 and 5.8. Using the above distri- 
bution function we compute the expected number of distinct consistently 
ordered seqment pairs with at least normalized score x. The results are 



displayed in Table 1.1 
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232 314 56 74112132172 


232 314 56 74 112 132 172 


19 


1077161-39636029 


61 - 83 - 709 - 7309 - 37338914351 


23I3I554 7411213217219 


23I3I556 7411213217219 


20 


229-5189-247913-1229957 


239 - 1181 - 2161 - 263188412702251 


235 315 58 74 112 132 172 192 


235 315 57 74 112 132 172 192 



TABLE 1.2. The number c], is the coefficient of y"^ in the small y expansion of 
E(Xy) and is the coefficient of y'^ in the small y expansion of Var(Xy). 
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